Skip to main content

Mixture of Gaussians (or Gaussian Mixture Model (GMM))

We use GMM when the situation that Gaussian latent variable model p(x)=zp(x,z)p(x) = \sum_z p(x, z) used for clustering.

A K multivariate Gaussian mixture densities has the form: p(x)=k=1KπkN(xμk,Σk)p(x) = \sum_{k = 1}^K \pi_k \mathcal{N}(x|\mu_k, \Sigma_k) where k=1Kπk=1,k,πk0\sum_{k = 1}^K\pi_k = 1, \forall k, \pi_k \ge 0 is called mixing coefficients.

Consider a latent variable z{1,,K}z \in \{1, \dots, K\}, then p(z=k)=πkp(z = k) = \pi_k and p(xz=k)=N(xμk,Σk)p(x|z = k) = \mathcal{N}(x|\mu_k, \Sigma_k) with joint density p(x,z=k)=p(z=k)p(xz=k)=πkN(xμk,Σk)p(x, z = k) = p(z = k)p(x|z = k) = \pi_k \mathcal{N}(x|\mu_k, \Sigma_k).

That is, we can also have p(z=kx)=πkN(xμk,Σk)k=1KπkN(xμk,Σk)p(z = k|x) = \frac{\pi_k \mathcal{N}(x|\mu_k, \Sigma_k)}{\sum_{k = 1}^K \pi_k \mathcal{N}(x|\mu_k, \Sigma_k)}.

Since multivariate Gaussian, let denote {x1,,xn}=XRn×m\{x_1, \ldots, x_n\} = X \in \R^{n \times m} and zRnz\in \R^n. Then we have the likelihood function logp(Xπ,μ,Σ)=i=1nlogk=1KπkN(xiμk,Σk)\log p(X|\pi, \mu, \Sigma) = \sum_{i = 1}^n \log \sum_{k = 1}^K \pi_k \mathcal{N}(x_i|\mu_k, \Sigma_k).

Then we have the MLE μk=inp(z=kxi)Nkxi\mu_k = \sum_{i}^{n} \frac{p(z = k|x_i)}{\mathcal{N}_k} x_i, Nk=inp(z=kxi)\mathcal{N}_k = \sum_{i}^{n} p(z = k|x_i), Sigmak=inp(z=kxi)Nk(xiμk)(xiμk)TSigma_k = \sum_{i}^{n} \frac{p(z = k|x_i)}{\mathcal{N}_k} (x_i - \mu_k)(x_i - \mu_k)^T, πk=Nkn\pi_k = \frac{\mathcal{N}_k}{n}.

EM Algorithm

Since MLE does not have a closed form solution and the estimation depend on the posterior distribution p(z=kxi)p(z = k|x_i). We can use EM algorithm to estimate the parameters.

The steps are:

  • Initialize π,μ,Σ\pi, \mu, \Sigma.
  • E-step: for each k,ik, i compute the posterior p(z=kxi)=πkN(xiμk,Σk)k=1KπkN(xiμk,Σk)p(z = k|x_i) = \frac{\pi_k \mathcal{N}(x_i|\mu_k, \Sigma_k)}{\sum_{k = 1}^K \pi_k \mathcal{N}(x_i|\mu_k, \Sigma_k)}.
  • M-step: re-estimate π,μ,Σ\pi, \mu, \Sigma using the mle formula.
  • Evaluate the log-likelihood and check for convergence.

More general one: Consider a general setting with latent variables. i.e. Observed dataset XRN×DX\in \R^{N\times D}, latent variables ZRN×KZ \in \R^{N×K}. Maximize the log-likelihood logp(Xθ)=log(Zp(X,Zθ))\log p(X|\theta) = \log (\sum_Z p(X, Z|\theta)):

  • Initialize parameters θold\theta_{old}.
  • E-step: use θold\theta_{old} to compute the posterior p(ZX,θold)p(Z|X, \theta_{old}).
  • M-step: θnew=arg maxθQ(θ,θold)\theta_{new} = \argmax_{\theta} Q(\theta, \theta_{old}), where Q(θ,θold)=Zp(ZX,θold)logp(X,Zθ)=E[logp(X,Zθ)X,θold]Q(\theta, \theta_{old}) = \sum_Z p(Z|X, \theta_{old})\log p(X, Z|\theta) = E[\log p(X, Z|\theta) | X, \theta_{old}] which is tractable in many applications.
  • Replace θoldθnew\theta_{old} \leftarrow \theta_{new}. Repeat until convergence.

If zz was observed, then the MLE would be trivial

logp(X,Zθ)=i=1nlogp(xi,ziθ)=i=1nk=1KI(zi=k)log(πkN(xiμk,Σk))\log p(X, Z|\theta) = \sum_{i = 1}^n \log p(x_i, z_i|\theta) = \sum_{i = 1}^n\sum_{k=1}^K\mathbb{I}(z_i = k) \log (\pi_k \mathcal{N} (x_i|\mu_k, \Sigma_k))

For the E-step: p(ZX,θ)=i=1np(ziX,θ)p(Z|X, \theta) = \sum_{i = 1}^n p(z_i|X, \theta) we have p(zi=kX,θ)=p(zi=kxi,θ)=πkNm(xiμk,Σk)j=1KπjNm(xiμj,Σj)p(z_i = k|X, \theta) = p(z_i = k|x_i, \theta) = \frac{\pi_k \mathcal{N}_m(x_i|\mu_k, \Sigma_k)} {\sum_{j=1}^K \pi_j \mathcal{N}_m(x_i|\mu_j , \Sigma_j)}.

For the M-step: E(I(zi=k)X,θold)=p(zi=kX,θold)E(\mathbb{I}(z_i = k)|X, \theta_{old}) = p(z_i = k|X, \theta_{old}) and so E[logp(X,Zθ)X,θold]=i=1nk=1Kp(zi=kX,θold)log(πkN(xiμk,Σk))E[\log p(X, Z|\theta) | X, \theta_{old}] = \sum_{i = 1}^n\sum_{k=1}^K p(z_i = k|X, \theta_{old})\log (\pi_k \mathcal{N}(x_i|\mu_k, \Sigma_k)).

K-means is a special case of EM algorithm.